The textbook for the Data Science course series is freely available online.
There are three major sections in this course: introduction to linear regression, linear models, and confounding.
In this section, you’ll learn the basics of linear regression through this course’s motivating example, the data-driven approach used to construct baseball teams. You’ll also learn about correlation, the correlation coefficient, stratification, and the variance explained.
In this section, you’ll learn about linear models. You’ll learn about least squares estimates, multivariate regression, and several useful features of R, such as tibbles, lm, do, and broom. You’ll learn how to apply regression to baseball to build a better offensive metric.
In the final section of the course, you’ll learn about confounding and several reasons that correlation is not the same as causation, such as spurious correlation, outliers, reversing cause and effect, and confounders. You’ll also learn about Simpson’s Paradox.
In the Introduction to Regression section, you will learn the basics of linear regression.
After completing this section, you will be able to:
This section has three parts: Baseball as a Motivating Example, Correlation, and Stratification and Variance Explained.
The corresponding section of the textbook is the case study on Moneyball
Key points
Bill James was the originator of the sabermetrics, the approach of using data to predict what outcomes best predicted if a team would win.
The corresponding section of the textbook is the section on baseball basics
Key points
The corresponding section of the textbook is the base on balls or stolen bases textbook section
Key points
The visualization of choice when exploring the relationship between two variables like home runs and runs is a scatterplot.
Code: Scatterplot of the relationship between HRs and runs
if(!require(Lahman)) install.packages("Lahman")
## Loading required package: Lahman
if(!require(tidyverse)) install.packages("tidyverse")
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.1
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if(!require(dslabs)) install.packages("dslabs")
## Loading required package: dslabs
library(Lahman)
library(tidyverse)
library(dslabs)
ds_theme_set()
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_per_game = HR / G, R_per_game = R / G) %>%
ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between stolen bases and runs
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(SB_per_game = SB / G, R_per_game = R / G) %>%
ggplot(aes(SB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Code: Scatterplot of the relationship between bases on balls and runs
Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_per_game = BB / G, R_per_game = R / G) %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
ggplot(aes(AB, R)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_line()
Teams %>% filter(yearID %in% 1961:2001 ) %>%
mutate(AB_per_game = AB/G, R_per_game = R/G) %>%
ggplot(aes(R_per_game, AB_per_game)) +
geom_point()
Hint: make sure to use the help file (?Teams).
Teams data frame to include years from 1961 to 2001. Make a scatterplot of runs per game versus at bats (AB) per game.Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(AB_per_game = AB / G, R_per_game = R / G) %>%
ggplot(aes(AB_per_game, R_per_game)) +
geom_point(alpha = 0.5)
Which of the following is true?
Teams data frame from Question 6. Make a scatterplot of win rate (number of wins per game) versus number of fielding errors (E) per game.Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(win_rate = W / G, E_per_game = E / G) %>%
ggplot(aes(win_rate, E_per_game)) +
geom_point(alpha = 0.5)
Which of the following is true?
Teams data frame from Question 6. Make a scatterplot of triples (X3B) per game versus doubles (X2B) per game.Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(doubles_per_game = X2B / G, triples_per_game = X3B / G) %>%
ggplot(aes(doubles_per_game, triples_per_game)) +
geom_point(alpha = 0.5)
Which of the following is true?
The corresponding textbook section is Case Study: is height hereditary?
Key points
Code
# create the dataset
if(!require(HistData)) install.packages("HistData")
## Loading required package: HistData
library(tidyverse)
library(HistData)
data("GaltonFamilies")
set.seed(1983)
galton_heights <- GaltonFamilies %>%
filter(gender == "male") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(father, childHeight) %>%
rename(son = childHeight)
# means and standard deviations
galton_heights %>%
summarize(mean(father), sd(father), mean(son), sd(son))
## # A tibble: 1 x 4
## `mean(father)` `sd(father)` `mean(son)` `sd(son)`
## <dbl> <dbl> <dbl> <dbl>
## 1 69.1 2.55 69.2 2.71
# scatterplot of father and son heights
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.5)
The corresponding textbook section is the correlation coefficient
Key points
Code
rho <- mean(scale(x)*scale(y))
data("GaltonFamilies")
galton_heights <- GaltonFamilies %>% filter(childNum == 1 & gender == "male") %>% select(father, childHeight) %>% rename(son = childHeight)
galton_heights %>% summarize(r = cor(father, son)) %>% pull(r)
## [1] 0.5007248
The corresponding textbook section is Sample correlation is a random variable
Key points
Code
# compute sample correlation
R <- sample_n(galton_heights, 25, replace = TRUE) %>%
summarize(r = cor(father, son))
R
## r
## 1 0.4787613
# Monte Carlo simulation to show distribution of sample correlation
B <- 1000
N <- 25
R <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
summarize(r = cor(father, son)) %>%
pull(r)
})
qplot(R, geom = "histogram", binwidth = 0.05, color = I("black"))
# expected value and standard error
mean(R)
## [1] 0.4970997
sd(R)
## [1] 0.1512451
# QQ-plot to evaluate whether N is large enough
data.frame(R) %>%
ggplot(aes(sample = R)) +
stat_qq() +
geom_abline(intercept = mean(R), slope = sqrt((1-mean(R)^2)/(N-2)))
Scatter plot relationship x and y
From this figure, the correlation between x and y appears to be about:
Would you expect the mean of our sample correlation to increase, decrease, or stay approximately the same?
Would you expect the standard deviation of our sample correlation to increase, decrease, or stay approximately the same?
Teams data frame to include years from 1961 to 2001.What is the correlation coefficient between number of runs per game and number of at bats per game?
library(Lahman)
Teams_small <- Teams %>% filter(yearID %in% 1961:2001)
cor(Teams_small$R/Teams_small$G, Teams_small$AB/Teams_small$G)
## [1] 0.6580976
Teams data frame from Question 7.What is the correlation coefficient between win rate (number of wins per game) and number of errors per game?
cor(Teams_small$W/Teams_small$G, Teams_small$E/Teams_small$G)
## [1] -0.3396947
Teams data frame from Question 7.What is the correlation coefficient between doubles (X2B) per game and triples (X3B) per game?
cor(Teams_small$X2B/Teams_small$G, Teams_small$X3B/Teams_small$G)
## [1] -0.01157404
There are three links to relevant sections of the textbook:
Key points
Code
# number of fathers with height 72 or 72.5 inches
sum(galton_heights$father == 72)
## [1] 8
sum(galton_heights$father == 72.5)
## [1] 1
# predicted height of a son with a 72 inch tall father
conditional_avg <- galton_heights %>%
filter(round(father) == 72) %>%
summarize(avg = mean(son)) %>%
pull(avg)
conditional_avg
## [1] 71.83571
# stratify fathers' heights to make a boxplot of son heights
galton_heights %>% mutate(father_strata = factor(round(father))) %>%
ggplot(aes(father_strata, son)) +
geom_boxplot() +
geom_point()
# center of each boxplot
galton_heights %>%
mutate(father = round(father)) %>%
group_by(father) %>%
summarize(son_conditional_avg = mean(son)) %>%
ggplot(aes(father, son_conditional_avg)) +
geom_point()
## `summarise()` ungrouping output (override with `.groups` argument)
# calculate values to plot regression line on original data
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m <- r * s_y/s_x
b <- mu_y - m*mu_x
# add regression line to plot
galton_heights %>%
ggplot(aes(father, son)) +
geom_point(alpha = 0.5) +
geom_abline(intercept = b, slope = m)
There is a link to the relevant section of the textbook: Bivariate normal distribution (advanced)
Key points
Code
galton_heights %>%
mutate(z_father = round((father - mean(father)) / sd(father))) %>%
filter(z_father %in% -2:2) %>%
ggplot() +
stat_qq(aes(sample = son)) +
facet_wrap( ~ z_father)
There is a link to the relevant section of the textbook: Variance explained
Key points
There is a link to the relevant section of the textbook: Warning: there are two regression lines
Key point
There are two different regression lines depending on whether we are taking the expectation of Y given X or taking the expectation of X given Y.
Code
# compute a regression line to predict the son's height from the father's height
mu_x <- mean(galton_heights$father)
mu_y <- mean(galton_heights$son)
s_x <- sd(galton_heights$father)
s_y <- sd(galton_heights$son)
r <- cor(galton_heights$father, galton_heights$son)
m_1 <- r * s_y / s_x
b_1 <- mu_y - m_1*mu_x
m_1 # slope 1
## [1] 0.5027904
b_1 # intercept 1
## [1] 35.71249
# compute a regression line to predict the father's height from the son's height
m_2 <- r * s_x / s_y
b_2 <- mu_x - m_2*mu_y
m_2 # slope 2
## [1] 0.4986676
b_2 # intercept 2
## [1] 33.96539
Scatter plot and regression line of son and father heights
Given this, what percent of the variation in sons’ heights is explained by fathers’ heights?
When two variables follow a bivariate normal distribution, the variation explained can be calculated as \(\rho^2 \times 100\).
Given a one inch increase in a father’s height, what is the predicted change in the son’s height?
The slope of the regression line is calculated by multiplying the correlation coefficient by the ratio of the standard deviation of son heights and standard deviation of father heights: \(\sigma_{son}/\sigma_{father}\).
In the second part of this assessment, you’ll analyze a set of mother and daughter heights, also from GaltonFamilies.
Define female_heights, a set of mother and daughter heights sampled from GaltonFamilies, as follows:
set.seed(1989, sample.kind="Rounding") #if you are using R 3.6 or later
## Warning in set.seed(1989, sample.kind = "Rounding"): non-uniform 'Rounding'
## sampler used
library(HistData)
data("GaltonFamilies")
female_heights <- GaltonFamilies%>%
filter(gender == "female") %>%
group_by(family) %>%
sample_n(1) %>%
ungroup() %>%
select(mother, childHeight) %>%
rename(daughter = childHeight)
Mean of mothers’ heights
mean(female_heights$mother)
## [1] 64.125
Standard deviation of mothers’ heights
sd(female_heights$mother)
## [1] 2.289292
Mean of daughters’ heights
mean(female_heights$daughter)
## [1] 64.28011
Standard deviation of daughters’ heights
sd(female_heights$daughter)
## [1] 2.39416
Correlation coefficient
cor(female_heights$mother, female_heights$daughter)
## [1] 0.3245199
Slope of regression line predicting daughters’ height from mothers’ heights
r <- cor(female_heights$mother, female_heights$daughter)
s_y <- sd(female_heights$daughter)
s_x <- sd(female_heights$mother)
r * s_y/s_x
## [1] 0.3393856
Intercept of regression line predicting daughters’ height from mothers’ heights
mu_y <- mean(female_heights$daughter)
mu_x <- mean(female_heights$mother)
mu_y - (r * s_y/s_x)*mu_x
## [1] 42.51701
Change in daughter’s height in inches given a 1 inch increase in the mother’s height
r * s_y/s_x
## [1] 0.3393856
r^2*100
## [1] 10.53132
What is the conditional expected value of her daughter’s height given the mother’s height?
m = r * s_y/s_x
b = mu_y - (r * s_y/s_x)*mu_x
x = 60
m*x+b
## [1] 62.88015
In the Linear Models section, you will learn how to do linear regression.
After completing this section, you will be able to:
do() function to bridge R functions and the tidyverse.tidy(), glance(), and augment() functions from the broom package.This section has four parts: Introduction to Linear Models, Least Squares Estimates, Tibbles, do, and broom, and Regression and Baseball.
The textbook for this section is available here
Key points
Code
# find regression line for predicting runs from BBs
bb_slope <- Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(BB_per_game = BB/G, R_per_game = R/G) %>%
lm(R_per_game ~ BB_per_game, data = .) %>%
.$coef %>%
.[2]
bb_slope
## BB_per_game
## 0.7353288
# compute regression line for predicting runs from singles
singles_slope <- Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(Singles_per_game = (H-HR-X2B-X3B)/G, R_per_game = R/G) %>%
lm(R_per_game ~ Singles_per_game, data = .) %>%
.$coef %>%
.[2]
singles_slope
## Singles_per_game
## 0.4494253
# calculate correlation between HR, BB and singles
Teams %>%
filter(yearID %in% 1961:2001 ) %>%
mutate(Singles = (H-HR-X2B-X3B)/G, BB = BB/G, HR = HR/G) %>%
summarize(cor(BB, HR), cor(Singles, HR), cor(BB,Singles))
## cor(BB, HR) cor(Singles, HR) cor(BB, Singles)
## 1 0.4039313 -0.1737435 -0.05603822
The textbook for this section is available here
Key points
Code
# stratify HR per game to nearest 10, filter out strata with few points
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR_strata = round(HR/G, 1),
BB_per_game = BB / G,
R_per_game = R / G) %>%
filter(HR_strata >= 0.4 & HR_strata <=1.2)
# scatterplot for each HR stratum
dat %>%
ggplot(aes(BB_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ HR_strata)
## `geom_smooth()` using formula 'y ~ x'
# calculate slope of regression line after stratifying by HR
dat %>%
group_by(HR_strata) %>%
summarize(slope = cor(BB_per_game, R_per_game)*sd(R_per_game)/sd(BB_per_game))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 2
## HR_strata slope
## <dbl> <dbl>
## 1 0.4 0.734
## 2 0.5 0.566
## 3 0.6 0.412
## 4 0.7 0.285
## 5 0.8 0.365
## 6 0.9 0.261
## 7 1 0.512
## 8 1.1 0.454
## 9 1.2 0.440
# stratify by BB
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(BB_strata = round(BB/G, 1),
HR_per_game = HR / G,
R_per_game = R / G) %>%
filter(BB_strata >= 2.8 & BB_strata <=3.9)
# scatterplot for each BB stratum
dat %>% ggplot(aes(HR_per_game, R_per_game)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
facet_wrap( ~ BB_strata)
## `geom_smooth()` using formula 'y ~ x'
# slope of regression line after stratifying by BB
dat %>%
group_by(BB_strata) %>%
summarize(slope = cor(HR_per_game, R_per_game)*sd(R_per_game)/sd(HR_per_game))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## BB_strata slope
## <dbl> <dbl>
## 1 2.8 1.52
## 2 2.9 1.57
## 3 3 1.52
## 4 3.1 1.49
## 5 3.2 1.58
## 6 3.3 1.56
## 7 3.4 1.48
## 8 3.5 1.63
## 9 3.6 1.83
## 10 3.7 1.45
## 11 3.8 1.70
## 12 3.9 1.30
The textbook for this section is available here
Key points
> lm(son ~ father, data = galton_heights)
Call:
lm(formula = son ~ father, data = galton_heights)
Coefficients:
(Intercept) father
35.71 0.50
Interpret the numeric coefficient for “father.”
galton_heights <- galton_heights %>%
mutate(father_centered=father - mean(father))
We run a linear model using this centered fathers’ height variable.
> lm(son ~ father_centered, data = galton_heights)
Call:
lm(formula = son ~ father_centered, data = galton_heights)
Coefficients:
(Intercept) father_centered
70.45 0.50
Interpret the numeric coefficient for the intercept.
beta1 = seq(0, 1, len=nrow(galton_heights))
results <- data.frame(beta1 = beta1,
rss = sapply(beta1, rss, beta0 = 25))
results %>% ggplot(aes(beta1, rss)) + geom_line() +
geom_line(aes(beta1, rss), col=2)
In a model for sons’ heights vs fathers’ heights, what is the least squares estimate (LSE) for \(\beta_1\) if we assume \(\hat{\beta}_{0}\) is 36?
What is the coefficient for bases on balls?
B <- 1000
N <- 100
lse <- replicate(B, {
sample_n(galton_heights, N, replace = TRUE) %>%
lm(son ~ father, data = .) %>% .$coef
})
lse <- data.frame(beta_0 = lse[1,], beta_1 = lse[2,])
What does the central limit theorem tell us about the variables beta_0 and beta_1?
$\beta_0 $
> mod <- lm(son ~ father, data = galton_heights)
> summary(mod)
Call:
lm(formula = son ~ father, data = galton_heights)
Residuals:
Min 1Q Median 3Q Max
-5.902 -1.405 0.092 1.342 8.092
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.7125 4.5174 7.91 2.8e-13 ***
father 0.5028 0.0653 7.70 9.5e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
$\beta_0 $
What null hypothesis is the second p-value (the one in the father row) testing?
galton_heights %>% ggplot(aes(father, son)) +
geom_point() +
geom_smooth()
galton_heights %>% ggplot(aes(father, son)) +
geom_point() +
geom_smooth(method = "lm")
model <- lm(son ~ father, data = galton_heights)
predictions <- predict(model, interval = c("confidence"), level = 0.95)
data <- as.tibble(predictions) %>% bind_cols(father = galton_heights$father)
ggplot(data, aes(x = father, y = fit)) +
geom_line(color = "blue", size = 1) +
geom_ribbon(aes(ymin=lwr, ymax=upr), alpha=0.2) +
geom_point(data = galton_heights, aes(x = father, y = son))
model <- lm(son ~ father, data = galton_heights)
predictions <- predict(model)
data <- as.tibble(predictions) %>% bind_cols(father = galton_heights$father)
ggplot(data, aes(x = father, y = fit)) +
geom_line(color = "blue", size = 1) +
geom_point(data = galton_heights, aes(x = father, y = son))
lm function does not know how to handle grouped tibbles.lm function cannot be put into a tidy format.You’ve already written the function get_slope, shown below.
get_slope <- function(data) {
fit <- lm(R ~ BB, data = data)
sum.fit <- summary(fit)
data.frame(slope = sum.fit$coefficients[2, "Estimate"],
se = sum.fit$coefficients[2, "Std. Error"],
pvalue = sum.fit$coefficients[2, "Pr(>|t|)"])
}
What additional code could you write to accomplish your goal?
dat %>%
group_by(HR) %>%
do(get_slope)
dat %>%
group_by(HR) %>%
do(get_slope(.))
dat %>%
group_by(HR) %>%
do(slope = get_slope(.))
dat %>%
do(get_slope(.))
dat <- Teams %>% filter(yearID %in% 1961:2001) %>%
mutate(HR = HR/G,
R = R/G) %>%
select(lgID, HR, BB, R)
What code would help you quickly answer this question?
dat %>%
group_by(lgID) %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
dat %>%
group_by(lgID) %>%
do(glance(lm(R ~ HR, data = .)))
dat %>%
do(tidy(lm(R ~ HR, data = .), conf.int = T)) %>%
filter(term == "HR")
dat %>%
group_by(lgID) %>%
do(mod = lm(R ~ HR, data = .))
lm(R ~ BB + HR)lm(HR ~ BB + singles + doubles + triples)lm(R ~ BB + singles + doubles + triples + HR)lm(R ~ singles + doubles + triples + HR)Look at the code from the video for a hint:
pa_per_game <- Batting %>%
filter(yearID == 2002) %>%
group_by(teamID) %>%
summarize(pa_per_game = sum(AB+BB)/max(G)) %>%
.$pa_per_game %>%
mean
pa_per_game: the mean number of plate appearances per team per game for each teampa_per_game: the mean number of plate appearances per game for each playerpa_per_game: the number of plate appearances per team per game, averaged across all teamsWhich team scores more runs, as predicted by our model?
In the Confounding section, you will learn what is perhaps the most important lesson of statistics: that correlation is not causation.
After completing this section, you will be able to:
This section has one part: Correlation is Not Causation.
The textbook for this section is available here
How many of these correlations would you expect to have a significant p-value (p>0.05), just by chance?